Webscraping with
tidyverse
Packages


Sam Tyner
(co-organizer R-Ladies Ames)

9 Feb 2017

Outline

  1. Introduction
    • What is webscraping?
    • Why webscrape?
  2. Webscraping in R
    • Available packages (other than tidyverse)
  3. The tidyverse?
  4. rvest quick start guide
    • Your Turn #1
  5. Deeper dive into rvest
    • Key functions
    • Your Turn #2
  6. Advanced Examples

Introduction

What is webscraping?

  • Extract data from websites
    • Tables
    • Links to other websites
    • Text

Why webscrape?

  • Because copy-paste is awful
  • Because it’s fast
  • Because you can automate it

Resources for
Webscraping in R

R Packages
for Webscraping

Lots to choose from: XML, XML2R, scrapeR, selectr, rjson, RSelenium, etc.

Many more (and links to the above) on the Web Technologies CRAN Task View

But, we’ll be using the tidyverse packages rvest and xml2

What is the tidyverse?

The Tidy Tools Manifesto

“The tidyverse is a set of packages that work in harmony…. The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command.” - RStudio Blog

  1. Reuse existing data structures. (i.e. stick with data frames!)
  2. Compose simple functions with the pipe. (Each function does one simple thing well.)
  3. Embrace functional programming. (OOPers may find this difficult. If you are totally lost, you’ll be fine.)
  4. Design for humans. (Code should be understood by humans first, then computers)

Familiar Friends

You may already have used:

  • ggplot2 for visualization
  • dplyr for data manipulation
  • tidyr for data tidying

Install all tidyverse packages in one fell swoop:

# check if you already have it
library(tidyverse)
# if not:
install.packages("tidyverse")
library(tidyverse) # only calls the "core" of tidyverse

tidyverse packages
for web data

  • httr: for web APIs (Application Programming Interface)
  • jsonlite: for JSON (JavaScript Object Notation) data from the web
  • xml2: for XML (eXtensible Markup Language) structured data
  • rvest: package of wrapper functions to xml2 and httr for easy web scraping

We’ll focus on rvest

Webscraping with rvest:
Step-by-Step Start Guide

Step 1: Find a URL

What data do you want?

  • Information on Oscar-nominated film Moonlight

Find it on the web!

# character variable containing the url you want to scrape
myurl <- "http://www.imdb.com/title/tt4975722/"

Step 2: Read HTML into R

“Huh? What am I doing?” - some of you right now

  • HTML is HyperText Markup Language. All webpages are written with it.
  • Go to any website, right click, click “View Page Source” to see the HTML
library(tidyverse)
library(rvest)
myhtml <- read_html(myurl)
myhtml
## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n            <img height="1" widt ...

Step 3: Figure out
where your data is

Need to find your data within the myhtml object.

Tags to look for:

  • <p>: paragraphs
  • <h1>, <h2>, etc.: headers
  • <a>: links
  • <li>: item in a list
  • <table>: tables

Use Selector Gadget to find the exact location. (Demo)

For more on HTML, I recommend W3schools’ tutorial >- You don’t need to be an expert in HTML to webscrape with rvest!

Step 4: Tell rvest where to find your data

Copy-paste from Selector Gadget or give HTML tags into html_nodes() to extract your data of interest

myhtml %>% html_nodes(".summary_text") %>% html_text()
## [1] "\n                    A young African-American man grapples with his identity and sexuality while experiencing the everyday struggles of childhood, adolescence, and burgeoning adulthood.\n            "
myhtml %>% html_nodes("table") %>% html_table(header = TRUE)
## [[1]]
##    Cast overview, first billed only: Cast overview, first billed only:
## 1                                 NA                    Mahershala Ali
## 2                                 NA                      Shariff Earp
## 3                                 NA                    Duan Sanderson
## 4                                 NA                   Alex R. Hibbert
## 5                                 NA                     Janelle Monáe
## 6                                 NA                     Naomie Harris
## 7                                 NA                       Jaden Piner
## 8                                 NA            Herman 'Caheei McGloun
## 9                                 NA                  Kamal Ani-Bellow
## 10                                NA                      Keomi Givens
## 11                                NA                   Eddie Blanchard
## 12                                NA                       Rudi Goblen
## 13                                NA                    Ashton Sanders
## 14                                NA                        Edson Jean
## 15                                NA                    Patrick Decile
##    Cast overview, first billed only:
## 1                                ...
## 2                                ...
## 3                                ...
## 4                                ...
## 5                                ...
## 6                                ...
## 7                                ...
## 8                                ...
## 9                                ...
## 10                               ...
## 11                               ...
## 12                               ...
## 13                               ...
## 14                               ...
## 15                               ...
##                        Cast overview, first billed only:
## 1                                                   Juan
## 2                                               Terrence
## 3            Azu \n  \n  \n  (as Duan 'Sandy' Sanderson)
## 4                   Little \n  \n  \n  (as Alex Hibbert)
## 5                                                 Teresa
## 6                                                  Paula
## 7                                            Kevin age 9
## 8  Longshoreman \n  \n  \n  (as Herman 'Caheej' McGloun)
## 9                                         Portable Boy 1
## 10                                        Portable Boy 2
## 11                                        Portable Boy 3
## 12                      Gee \n  \n  \n  (as Rudi Goblin)
## 13                                                Chiron
## 14                                            Mr. Pierce
## 15                                                Terrel

Step 5: Save & tidy data

library(stringr)
library(magrittr)
mydat <- myhtml %>% 
  html_nodes("table") %>%
  extract2(1) %>% 
  html_table(header = TRUE)
mydat <- mydat[,c(2,4)]
names(mydat) <- c("Actor", "Role")
mydat <- mydat %>% 
  mutate(Actor = Actor,
         Role = str_replace_all(Role, "\n  ", ""))
mydat
##                     Actor                                      Role
## 1          Mahershala Ali                                      Juan
## 2            Shariff Earp                                  Terrence
## 3          Duan Sanderson           Azu (as Duan 'Sandy' Sanderson)
## 4         Alex R. Hibbert                  Little (as Alex Hibbert)
## 5           Janelle Monáe                                    Teresa
## 6           Naomie Harris                                     Paula
## 7             Jaden Piner                               Kevin age 9
## 8  Herman 'Caheei McGloun Longshoreman (as Herman 'Caheej' McGloun)
## 9        Kamal Ani-Bellow                            Portable Boy 1
## 10           Keomi Givens                            Portable Boy 2
## 11        Eddie Blanchard                            Portable Boy 3
## 12            Rudi Goblen                      Gee (as Rudi Goblin)
## 13         Ashton Sanders                                    Chiron
## 14             Edson Jean                                Mr. Pierce
## 15         Patrick Decile                                    Terrel

Your Turn #1

Using rvest, scrape a table from Wikipedia. You can pick your own table or you can get one of the tables in the country GDP per capita example from earlier.

Your result should be a data frame with one observation per row and one variable per column.

Your Turn #1 Solution

library(rvest)
library(magrittr)
myurl <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita"
myhtml <- read_html(myurl)
myhtml %>% 
 html_nodes("table") %>%
 extract2(3) %>%
 html_table(header = TRUE, fill = T) %>% 
 mutate(`Int$` = parse_number(`Int$`)) %>% 
 head
##   Rank Country/Territory   Int$
## 1    1             Qatar 138910
## 2    —             Macau 113352
## 3    2        Luxembourg 112045
## 4    3         Singapore 105689
## 5    4           Ireland  86988
## 6    5            Brunei  85011

Deeper dive into rvest

Key Functions: html_nodes

  • html_nodes(x, "path") extracts all elements from the page x that have the tag / class / id path. (Use SelectorGadget to determine path.)
  • html_node() does the same thing but only returns the first matching element.
  • Can be chained
myhtml %>% 
  html_nodes("p") %>% # first get all the paragraphs 
  html_nodes("a") # then get all the links in those paragraphs
## {xml_nodeset (39)}
##  [1] <a href="/wiki/Gross_domestic_product" title="Gross domestic product">gr ...
##  [2] <a href="/wiki/Purchasing_power_parity" title="Purchasing power parity"> ...
##  [3] <a href="/wiki/Goods_and_services" title="Goods and services">goods and  ...
##  [4] <a href="/wiki/International_dollar" title="International dollar">Int$</a>
##  [5] <a href="#cite_note-world-2019-3">[n 1]</a>
##  [6] <a href="/wiki/List_of_countries_by_wealth_per_adult" title="List of cou ...
##  [7] <a href="/wiki/Gross_domestic_product" title="Gross domestic product">gr ...
##  [8] <a href="/wiki/Per_capita" title="Per capita">per capita</a>
##  [9] <a href="/wiki/IMF" class="mw-redirect" title="IMF">IMF</a>
## [10] <a href="/wiki/World_Bank" title="World Bank">World Bank</a>
## [11] <a href="/wiki/Savings" class="mw-redirect" title="Savings">savings</a>
## [12] <a href="/wiki/Cost_of_living" title="Cost of living">cost of living</a>
## [13] <a href="/wiki/List_of_countries_by_GDP_(nominal)_per_capita" title="Lis ...
## [14] <a href="https://en.wiktionary.org/wiki/generalized" class="extiw" title ...
## [15] <a href="/wiki/Living_standards" class="mw-redirect" title="Living stand ...
## [16] <a href="/wiki/Inflation_rates" class="mw-redirect" title="Inflation rat ...
## [17] <a href="/wiki/Exchange_rates" class="mw-redirect" title="Exchange rates ...
## [18] <a href="#cite_note-4">[3]</a>
## [19] <a href="#cite_note-5">[4]</a>
## [20] <a href="/wiki/Personal_income" title="Personal income">personal income</a>
## ...

Key Functions: html_text

  • html_text(x) extracts all text from the nodeset x
  • Good for cleaning output
myhtml %>% 
  html_nodes("p") %>% # first get all the paragraphs 
  html_nodes("a") %>% # then get all the links in those paragraphs
  html_text() # get the linked text only 
##  [1] "gross domestic product"                       
##  [2] "purchasing power parity"                      
##  [3] "goods and services"                           
##  [4] "Int$"                                         
##  [5] "[n 1]"                                        
##  [6] "list of countries by wealth per adult"        
##  [7] "gross domestic product"                       
##  [8] "per capita"                                   
##  [9] "IMF"                                          
## [10] "World Bank"                                   
## [11] "savings"                                      
## [12] "cost of living"                               
## [13] "List of countries by GDP (nominal) per capita"
## [14] "generalized"                                  
## [15] "living standards"                             
## [16] "inflation rates"                              
## [17] "exchange rates"                               
## [18] "[3]"                                          
## [19] "[4]"                                          
## [20] "personal income"                              
## [21] "Standard of living and GDP"                   
## [22] "international dollars"                        
## [23] "rounded"                                      
## [24] "whole number"                                 
## [25] "economies"                                    
## [26] "sovereign states"                             
## [27] "dependent territories"                        
## [28] "tax havens"                                   
## [29] "corporate tax havens"                         
## [30] "[9]"                                          
## [31] "tax haven lists"                              
## [32] "[10]"                                         
## [33] "[11]"                                         
## [34] "leprechaun economics"                         
## [35] "BEPS"                                         
## [36] "modified gross national income"               
## [37] "corporate tax havens"                         
## [38] "major global tax havens"                      
## [39] "GDP-per-capita tax haven proxy"

Key Functions: html_table

  • html_table(x, header, fill) - parse html table(s) from x into a data frame or list of data frames
  • Structure of HTML makes finding and extracting tables easy!
myhtml %>% 
  html_nodes("table") %>% # get the tables 
  head(2) # look at first 2
## {xml_nodeset (2)}
## [1] <table width="100%"><tbody><tr>\n<td valign="top"> <div class="legend" st ...
## [2] <table style="font-size:100%;"><tbody>\n<tr>\n<td width="30%" align="cent ...
myhtml %>% 
  html_nodes("table") %>% # get the tables 
  extract2(3) %>% # pick the second one to parse
  html_table(header = TRUE) # parse table 
##     Rank                 Country/Territory    Int$
## 1      1                             Qatar 138,910
## 2      —                             Macau 113,352
## 3      2                        Luxembourg 112,045
## 4      3                         Singapore 105,689
## 5      4                           Ireland  86,988
## 6      5                            Brunei  85,011
## 7      6                            Norway  79,638
## 8      7              United Arab Emirates  70,441
## 9      8                            Kuwait  67,891
## 10     9                       Switzerland  67,558
## 11    10                     United States  67,426
## 12     —                         Hong Kong  66,527
## 13    11                        San Marino  62,913
## 14    12                       Netherlands  60,299
## 15     —                            Taiwan  57,214
## 16    13                           Iceland  56,974
## 17    14                      Saudi Arabia  56,912
## 18    15                            Sweden  55,989
## 19    16                           Denmark  55,675
## 20    17                           Germany  55,306
## 21    18                           Austria  55,171
## 22    19                         Australia  54,799
## 23    20                            Canada  52,144
## 24    21                           Bahrain  51,991
## 25    22                           Belgium  50,904
## 26    23                             Malta  49,589
## 27    24                           Finland  49,548
## 28    25                            France  48,640
## 29    26                              Oman  48,593
## 30    27                    United Kingdom  48,169
## 31    28                             Japan  46,827
## 32    29                      Korea, South  46,452
## 33    30                             Spain  43,007
## 34    31                            Cyprus  42,956
## 35    32                       New Zealand  42,045
## 36    33                             Italy  41,582
## 37     —                       Puerto Rico  41,198
## 38    34                    Czech Republic  40,585
## 39    35                          Slovenia  40,344
## 40    36                            Israel  40,337
## 41    37                         Lithuania  38,751
## 42    38                          Slovakia  38,321
## 43    39                           Estonia  37,606
## 44    40                           Hungary  35,941
## 45    41                            Poland  35,651
## 46    42                          Portugal  34,936
## 47    43                          Malaysia  34,567
## 48    44               Trinidad and Tobago  33,713
## 49    45                      Bahamas, The  33,432
## 50    46                        Seychelles  33,118
## 51    47                            Latvia  32,987
## 52    48             Saint Kitts and Nevis  31,950
## 53    49                            Greece  31,616
## 54    50                            Russia  30,820
## 55    51               Antigua and Barbuda  30,593
## 56    52                        Kazakhstan  30,178
## 57    53                           Romania  29,555
## 58    54                            Turkey  29,327
## 59    55                           Croatia  29,207
## 60    56                            Panama  28,456
## 61    57                             Chile  27,150
## 62    58                         Mauritius  26,461
## 63    59                          Bulgaria  26,034
## 64    60                          Maldives  24,796
## 65    61                           Uruguay  24,516
## 66    62                        Montenegro  21,977
## 67    63                      Turkmenistan  21,855
## 68    64                            Mexico  21,363
## 69    65                          Thailand  21,361
## 70    66                           Belarus  21,224
## 71    67        People's Republic of China  20,984
## 72    68                Dominican Republic  20,625
## 73    69                         Argentina  19,971
## 74    70                 Equatorial Guinea  19,961
## 75    71                             Gabon  19,839
## 76    72                            Serbia  19,767
## 77    73                          Botswana  19,388
## 78    74                          Barbados  19,364
## 79    75                        Azerbaijan  19,156
## 80    76                              Iraq  18,755
## 81    77                        Costa Rica  18,651
## 82     —                        World[n 2]  18,391
## 83    78                              Iran  17,832
## 84    79                           Grenada  17,434
## 85    80                   North Macedonia  17,378
## 86    81                            Guyana  17,163
## 87    82                            Brazil  17,106
## 88    83                             Palau  16,855
## 89    84                          Colombia  16,265
## 90    85                           Algeria  16,091
## 91    86                          Suriname  16,044
## 92    87                           Lebanon  15,599
## 93    88                              Peru  15,399
## 94    89                       Saint Lucia  15,159
## 95    90                          Mongolia  15,089
## 96    91            Bosnia and Herzegovina  14,894
## 97    92                           Albania  14,866
## 98    93                         Indonesia  14,841
## 99    94                             Egypt  14,800
## 100   95                         Sri Lanka  14,509
## 101   96                      South Africa  13,965
## 102   97                          Paraguay  13,213
## 103   98                           Georgia  13,200
## 104   99                           Tunisia  13,093
## 105    —                            Kosovo  13,017
## 106  100  Saint Vincent and the Grenadines  12,983
## 107  101                          Dominica  12,851
## 108  102                              Fiji  12,689
## 109  103                           Ecuador  11,866
## 110  104                           Armenia  11,845
## 111  105                           Namibia  11,451
## 112  106                          Eswatini  11,139
## 113  107                            Bhutan  10,627
## 114  108                           Ukraine  10,130
## 115  109                       Philippines  10,094
## 116  110                            Jordan   9,939
## 117  111                           Jamaica   9,932
## 118  112                           Morocco   9,667
## 119  113                        Uzbekistan   9,595
## 120  114                             Libya   9,446
## 121  115                             Nauru   9,073
## 122  116                             India   9,027
## 123  117                         Guatemala   9,009
## 124  118                            Belize   8,791
## 125  119                        Cape Verde   8,716
## 126  120                              Laos   8,684
## 127  121                           Vietnam   8,677
## 128  122                       El Salvador   8,593
## 129  123                           Bolivia   8,525
## 130  124                           Moldova   8,161
## 131  125            Congo, Republic of the   7,336
## 132  126                             Ghana   7,343
## 133  127                           Myanmar   7,220
## 134  128                             Tonga   6,867
## 135  129                            Angola   6,763
## 136  130                             Samoa   6,493
## 137  131                           Nigeria   6,172
## 138  132                          Pakistan   6,016
## 139  133                          Djibouti   5,855
## 140  134                          Honduras   5,600
## 141  135                        Bangladesh   5,453
## 142  136                       Timor-Leste   5,321
## 143  137                         Nicaragua   5,297
## 144  138                        Mauritania   5,158
## 145  139                          Cambodia   5,004
## 146  140                     Côte d'Ivoire   4,754
## 147  141                            Tuvalu   4,535
## 148  142                        Kyrgyzstan   4,193
## 149  143                            Zambia   4,174
## 150  144                          Cameroon   4,099
## 151  145                  Papua New Guinea   4,081
## 152  146                           Senegal   4,079
## 153  147                             Kenya   4,078
## 154  148                             Sudan   3,986
## 155  149                  Marshall Islands   3,972
## 156  150                        Tajikistan   3,751
## 157  151   Micronesia, Federated States of   3,657
## 158  152                           Lesotho   3,655
## 159  153                          Tanzania   3,652
## 160  154                             Benin   3,648
## 161  155                             Nepal   3,550
## 162  156             São Tomé and Príncipe   3,499
## 163  157                           Vanuatu   3,039
## 164  158                           Comoros   2,898
## 165  159                       Gambia, The   2,892
## 166  160                          Zimbabwe   2,778
## 167  161                            Uganda   2,753
## 168  162                          Ethiopia   2,702
## 169  163                            Rwanda   2,642
## 170  164                              Chad   2,603
## 171  165                            Guinea   2,574
## 172  166                              Mali   2,569
## 173  167                   Solomon Islands   2,363
## 174  168                             Yemen   2,312
## 175  169                          Kiribati   2,193
## 176  170                       Afghanistan   2,182
## 177  171                      Burkina Faso   2,181
## 178  172                     Guinea-Bissau   2,113
## 179  173                             Haiti   1,916
## 180  174                              Togo   1,913
## 181  175                        Madagascar   1,776
## 182  176                      Sierra Leone   1,765
## 183  177                       South Sudan   1,715
## 184  178                           Liberia   1,428
## 185  179                        Mozambique   1,372
## 186  180                            Malawi   1,292
## 187  181                             Niger   1,152
## 188  182                           Eritrea   1,103
## 189  183 Congo, Democratic Republic of the     873
## 190  184          Central African Republic     864
## 191  185                           Burundi     724
## 192    —                             Syria     n/a
## 193    —                         Venezuela     n/a

Key functions: html_attrs

  • html_attrs(x) - extracts all attribute elements from a nodeset x
  • html_attr(x, name) - extracts the name attribute from all elements in nodeset x
  • Attributes are things in the HTML like href, title, class, style, etc.
  • Use these functions to find and extract your data
myhtml %>% 
  html_nodes("table") %>% extract2(2) %>%
  html_attrs()
##             style 
## "font-size:100%;"
myhtml %>% 
  html_nodes("p") %>% html_nodes("a") %>%
  html_attr("href")
##  [1] "/wiki/Gross_domestic_product"                                                                  
##  [2] "/wiki/Purchasing_power_parity"                                                                 
##  [3] "/wiki/Goods_and_services"                                                                      
##  [4] "/wiki/International_dollar"                                                                    
##  [5] "#cite_note-world-2019-3"                                                                       
##  [6] "/wiki/List_of_countries_by_wealth_per_adult"                                                   
##  [7] "/wiki/Gross_domestic_product"                                                                  
##  [8] "/wiki/Per_capita"                                                                              
##  [9] "/wiki/IMF"                                                                                     
## [10] "/wiki/World_Bank"                                                                              
## [11] "/wiki/Savings"                                                                                 
## [12] "/wiki/Cost_of_living"                                                                          
## [13] "/wiki/List_of_countries_by_GDP_(nominal)_per_capita"                                           
## [14] "https://en.wiktionary.org/wiki/generalized"                                                    
## [15] "/wiki/Living_standards"                                                                        
## [16] "/wiki/Inflation_rates"                                                                         
## [17] "/wiki/Exchange_rates"                                                                          
## [18] "#cite_note-4"                                                                                  
## [19] "#cite_note-5"                                                                                  
## [20] "/wiki/Personal_income"                                                                         
## [21] "/wiki/Gross_domestic_product#Standard_of_living_and_GDP:_Wealth_distribution_and_externalities"
## [22] "/wiki/International_dollar"                                                                    
## [23] "/wiki/Rounding"                                                                                
## [24] "/wiki/Integer"                                                                                 
## [25] "/wiki/Economy"                                                                                 
## [26] "/wiki/Sovereign_state"                                                                         
## [27] "/wiki/Dependent_territories"                                                                   
## [28] "/wiki/Tax_havens"                                                                              
## [29] "/wiki/Corporate_tax_haven"                                                                     
## [30] "#cite_note-qqtz-11"                                                                            
## [31] "/wiki/Tax_haven#Tax_haven_lists"                                                               
## [32] "#cite_note-dhar-12"                                                                            
## [33] "#cite_note-imfx-13"                                                                            
## [34] "/wiki/Leprechaun_economics"                                                                    
## [35] "/wiki/BEPS"                                                                                    
## [36] "/wiki/Modified_gross_national_income"                                                          
## [37] "/wiki/Corporate_tax_haven"                                                                     
## [38] "/wiki/Tax_haven#Top_10_tax_havens"                                                             
## [39] "/wiki/Corporate_haven#GDP-per-capita_tax_haven_proxy"

Other functions

  • html_children - list the “children” of the HTML page. Can be chained like html_nodes
  • html_name - gives the tags of a nodeset. Use in a chain with html_children
myhtml %>% 
  html_children() %>% 
  html_name()
## [1] "head" "body"
  • html_form - parses HTML forms (checkboxes, fill-in-the-blanks, etc.)
  • html_session - simulate a session in an html browser; use the functions jump_to, back to navigate through the page

Your Turn #2

Find another website you want to scrape (ideas: all bills in the house so far this year, video game reviews, anything Wikipedia) and use at least 3 different rvest functions in a chain to extract some data.

Advanced Examples:
Into the Weeds

Example #1: Inaugural Addresses

The Data

  • The Avalon Project has most of the U.S. Presidential inaugural addresses.
  • Obama 2013, Trump 2017, VanBuren 1837, Buchanan 1857, Garfield 1881, and Coolidge 1925 are missing, but are easily found elsewhere. I have them saved as text files on Github
  • Let’s scrape all of them from The Avalon Project!

Get data frame of addresses

  • Could use another source to get this data of President names and years of inaugurations, but we’ll use The Avalon Project’s site because it’s a good example of data that needs tidying.
url <- "http://avalon.law.yale.edu/subject_menus/inaug.asp"
# even though it's called "all inaugs" some are missing
all_inaugs <- (url %>% 
  read_html() %>% 
  html_nodes("table") %>% 
  html_table(fill=T, header = T)) %>% extract2(3)
# table of addresses
all_inaugs_tidy <- all_inaugs %>% 
  gather(term, year, -President) %>% 
  filter(!is.na(year)) %>% 
  select(-term) %>% 
  arrange(year)
head(all_inaugs_tidy)
##           President year
## 1 George Washington 1789
## 2 George Washington 1793
## 3        John Adams 1797
## 4  Thomas Jefferson 1801
## 5  Thomas Jefferson 1805
## 6     James Madison 1809

Automate scraping

  • A function to read the addresses and get the text of the speeches, with a catch for a read error
get_inaugurations <- function(url){
  test <- try(url %>% read_html(), silent=T)
  if ("try-error" %in% class(test)) {
    return(NA)
  } else
    url %>% read_html() %>%
      html_nodes("p") %>% 
      html_text() -> address
    return(unlist(address))
}

# takes about 30 secs to run
all_inaugs_text <- all_inaugs_tidy %>% 
  mutate(address_text = (map(url, get_inaugurations))) 

all_inaugs_text$address_text[[1]]
## [1] " Fellow-Citizens of the Senate and of the House of Representatives: "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [2] "Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years--a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies. In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected. All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow-citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead me, and its consequences be judged by my country with some share of the partiality in which they originated. "                                                                                                                                                                                                                                                                                                                                                                                           
## [3] "Such being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge. In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow- citizens at large less than either. No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States. Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage. These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed. You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence. "                                                                                                                                                                                                                                                                                                                                              
## [4] "By the article establishing the executive department it is made the duty of the President \"to recommend to your consideration such measures as he shall judge necessary and expedient.\" The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given. It will be more consistent with those circumstances, and far more congenial with the feelings which actuate me, to substitute, in place of a recommendation of particular measures, the tribute that is due to the talents, the rectitude, and the patriotism which adorn the characters selected to devise and adopt them. In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments, no separate views nor party animosities, will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests, so, on another, that the foundation of our national policy will be laid in the pure and immutable principles of private morality, and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world. I dwell on this prospect with every satisfaction which an ardent love for my country can inspire, since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness; between duty and advantage; between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity; since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained; and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered, perhaps, as deeply, as finally, staked on the experiment entrusted to the hands of the American people. "
## [5] "Besides the ordinary objects submitted to your care, it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system, or by the degree of inquietude which has given birth to them. Instead of undertaking particular recommendations on this subject, in which I could be guided by no lights derived from official opportunities, I shall again give way to my entire confidence in your discernment and pursuit of the public good; for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government, or which ought to await the future lessons of experience, a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted. "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [6] "To the foregoing observations I have one to add, which will be most properly addressed to the House of Representatives. It concerns myself, and will therefore be as brief as possible. When I was first honored with a call into the service of my country, then on the eve of an arduous struggle for its liberties, the light in which I contemplated my duty required that I should renounce every pecuniary compensation. From this resolution I have in no instance departed; and being still under the impressions which produced it, I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department, and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require. "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## [7] "Having thus imparted to you my sentiments as they have been awakened by the occasion which brings us together, I shall take my present leave; but not without resorting once more to the benign Parent of the Human Race in humble supplication that, since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity, and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness, so His divine blessing may be equally conspicuous in the enlarged views, the temperate consultations, and the wise measures on which the success of this Government must depend. "

Add Missings

all_inaugs_text$President[is.na(all_inaugs_text$address_text)]
## [1] "Martin Van Buren"  "James Buchanan"    "James A. Garfield"
## [4] "Calvin Coolidge"
# there are 7 missing at this point: obama's and trump's, plus coolidge, garfield, buchanan, and van buren, which errored in the scraping.
obama09 <- get_inaugurations("http://avalon.law.yale.edu/21st_century/obama.asp")
obama13 <- readLines("speeches/obama2013.txt")
trump17 <- readLines("speeches/trumpinaug.txt")
vanburen1837 <- readLines("speeches/vanburen1837.txt") # row 13
buchanan1857 <- readLines("speeches/buchanan1857.txt") # row 18
garfield1881 <- readLines("speeches/garfield1881.txt") # row 24
coolidge1925 <- readLines("speeches/coolidge1925.txt") # row 35
all_inaugs_text$address_text[c(13,18,24,35)] <- list(vanburen1837,buchanan1857, garfield1881, coolidge1925)

# lets combine them all now
recents <- data.frame(President = c(rep("Barack Obama", 2), 
                                    "Donald Trump"),
                      year = c(2009, 2013, 2017), 
                      url = NA,
                      address_text = NA)

all_inaugs_text <- rbind(all_inaugs_text, recents)
all_inaugs_text$address_text[c(56:58)] <- list(obama09, obama13, trump17)

Check-in: What did we do?

  1. We found some interesting data to scrape from the web.
  2. We used tidy tools to create tidy data:
    • A data frame of President and year. One observation per row!
    • Stored urls we wished to scrape with their data
    • Stored the scraped speech with the matching President, year, and url
  3. We used the consistent HTML structure of the urls we wanted to scrape to automate collection of web data
    • Way faster than copy-paste!
    • Though we had to do some by hand, we took advantage of the tidy data and added the missing data manually without much pain.
  4. We now have a tidy data set of Presidential inaugural addresses for text analysis!
    • Each variable forms a column
    • Each observation forms a row
    • Each type of observational unit forms a table

A (Small) Text Analysis

Now, I use the tidytext package to get the words out of each inaugural address.

# install.packages("tidytext")
library(tidytext)
all_inaugs_text %>% 
  select(-url) %>% 
  unnest() %>% 
  unnest_tokens(word, address_text) -> presidential_words
head(presidential_words)
## # A tibble: 6 x 3
##   President          year word    
##   <chr>             <dbl> <chr>   
## 1 George Washington  1789 fellow  
## 2 George Washington  1789 citizens
## 3 George Washington  1789 of      
## 4 George Washington  1789 the     
## 5 George Washington  1789 senate  
## 6 George Washington  1789 and

Longest speeches

presidential_words %>% 
  group_by(President,year) %>% 
  summarize(num_words = n()) %>%
  arrange(desc(num_words)) -> presidential_wordtotals

Example #2: Notable Deaths

The Data

  • 2016 felt to many people like a year of loss: David Bowie, Prince, Alan Rickman, Carrie Fisher, and many more celebrities passed away in 2016
  • But were there really more “celebrity deaths” than any other year?
  • Wikipedia has a list of notable deaths every year, going all the way back to 1987.
  • We can scrape Wikipedia pages for this data.

Scraping Wikipedia

First, get all the URLs for the Wikipedia articles for the years of 1987-2016.

years <- 1987:2016
urls <- paste0("https://en.wikipedia.org/wiki/", years, "#Deaths")

Next, create a data frame to store all of the data.

celebDeaths <- data.frame(year = years, url = urls,
                          stringsAsFactors = FALSE)

Look at the HTML

urls[1] %>% read_html() %>% html_children() %>%
  html_nodes("h2")
## {xml_nodeset (7)}
## [1] <h2 id="mw-toc-heading">Contents</h2>\n
## [2] <h2>\n<span class="mw-headline" id="Events">Events</span><span class="mw- ...
## [3] <h2>\n<span class="mw-headline" id="Births">Births</span><span class="mw- ...
## [4] <h2>\n<span class="mw-headline" id="Deaths">Deaths</span><span class="mw- ...
## [5] <h2>\n<span class="mw-headline" id="Nobel_Prizes">Nobel Prizes</span><spa ...
## [6] <h2>\n<span class="mw-headline" id="References">References</span><span cl ...
## [7] <h2>Navigation menu</h2>
urls[1] %>% read_html() %>% html_children() %>%
  html_nodes("li")
## {xml_nodeset (1469)}
##  [1] <li><a href="/wiki/19th_century" title="19th century">19th century</a></li>
##  [2] <li><b><a href="/wiki/20th_century" title="20th century">20th century</a ...
##  [3] <li><a href="/wiki/21st_century" title="21st century">21st century</a></li>
##  [4] <li><a href="/wiki/1960s" title="1960s">1960s</a></li>
##  [5] <li><a href="/wiki/1970s" title="1970s">1970s</a></li>
##  [6] <li><b><a href="/wiki/1980s" title="1980s">1980s</a></b></li>
##  [7] <li><a href="/wiki/1990s" title="1990s">1990s</a></li>
##  [8] <li><a href="/wiki/2000s_(decade)" title="2000s (decade)">2000s</a></li>
##  [9] <li><a href="/wiki/1984" title="1984">1984</a></li>
## [10] <li><a href="/wiki/1985" title="1985">1985</a></li>
## [11] <li><a href="/wiki/1986" title="1986">1986</a></li>
## [12] <li><b><a class="mw-selflink selflink">1987</a></b></li>
## [13] <li><a href="/wiki/1988" title="1988">1988</a></li>
## [14] <li><a href="/wiki/1989" title="1989">1989</a></li>
## [15] <li><a href="/wiki/1990" title="1990">1990</a></li>
## [16] <li><a href="/wiki/1987_in_archaeology" title="1987 in archaeology">Arch ...
## [17] <li><a href="/wiki/1987_in_architecture" title="1987 in architecture">Ar ...
## [18] <li><a href="/wiki/1987_in_art" title="1987 in art">Art</a></li>
## [19] <li><a href="/wiki/1987_in_aviation" title="1987 in aviation">Aviation</ ...
## [20] <li><a href="/wiki/Category:1987_awards" title="Category:1987 awards">Aw ...
## ...

Start Scraping

  • Write a function for scraping all the years, just like with the Presidents’ inaugural addresses
  • Unfortunately, the lists aren’t as structured as the Wikipedia table
  • This creates some difficulties…
  • But, luckily, the same exact difficulties exist on each page, so we only have to deal with them once!

Write the function (1/2)

  • Heads up - this is a difficult example. Don’t worry if you don’t understand everything right away
  • Also, this is not a unique solution to this problem
get_deaths <- function(url){
  # get the main content page
  page <- url %>% read_html() %>% 
    html_nodes("#mw-content-text") %>% html_children() %>% 
    html_children()
  # get the names of all elements 
  tagnames <- page %>% html_name()
  # where are the big section headers
  h2s <- which(tagnames == "h2")
  # to find the heading labeled "Deaths"
  h2childids <- page[h2s] %>% html_children() %>% html_attr("id")
  idDeaths <- which(h2childids == "Deaths")
  # list of deaths starts after the location of deathStart and 
  # ends immediately before the location of deathEnd (next big header)
  deathStart <- h2s[(idDeaths+1)/2]
  deathEnd <- h2s[(idDeaths+1)/2+1]
  # get the deaths
  death_elements <- page[(deathStart+1):(deathEnd-1)] 
  deaths <- death_elements %>% html_nodes("li") %>% html_text()

(continued on next slide)

Write the function (2/2)

# there are two types of deaths: there was only one death that day in that year (a)
  deathsa <- data.frame(death = deaths[grep("–", deaths)])
  deathsa <- deathsa %>% 
    separate(death, into = c("Date", "Person"), sep = " – ") %>% 
    separate(Date, into = c("Month", "Day"), sep = " ") %>%
    separate(Person, into = c("Name", "Desc"), sep = ", ", extra = "merge") 
  # or there were multiple deaths that day in that year (b) 
  deathsb <- data.frame(death = deaths[-grep("–", deaths)], stringsAsFactors = F)
  # remove repeats
  deathsb <- data.frame(death = deathsb[grep("\n",deathsb$death),], stringsAsFactors = F)
  # tidy up the data
  deathsb %>% 
    separate(death, into = c("Date", "Other"), sep = "\\n", extra="merge") %>%
    separate(Other, into = paste0("Person", 1:6), sep = "\\n", fill = "right") %>% 
    gather(Person, Desc, -Date) %>% 
    select(Date, Desc) %>%
    filter(!is.na(Desc)) -> deathsb
  deathsb %>% separate(Desc, into = c("Name", "Desc"), sep = ", ", extra = "merge") %>%
    separate(Date, into = c("Month", "Day"), sep = " ") %>%
    filter(!is.na(Desc)) -> deathsb
  #combine the 2 sets
  deaths <- rbind(deathsa, deathsb)

  return(deaths)
} 

Use the function!

  • Use the same tidy principles we used for the inaugural example.
# should take about 10 seconds
celebDeaths <- celebDeaths %>% 
  mutate(Deaths = map(url, get_deaths)) %>%
  unnest()
head(celebDeaths[,-2])
## # A tibble: 6 x 5
##    year Month   Day   Name             Desc                                     
##   <int> <chr>   <chr> <chr>            <chr>                                    
## 1  1987 January 2     Jean de Gribaldy French road cyclist and directeur sporti…
## 2  1987 January 5     Herman Smith-Jo… Norwegian supercentenarian (b. 1875)     
## 3  1987 January 9     Arthur Lake      American actor (b. 1905)                 
## 4  1987 January 13    Turgut Demirağ   Turkish film producer, director and scre…
## 5  1987 January 14    Douglas Sirk     German-born film director (b. 1897)      
## 6  1987 January 15    Ray Bolger       American actor, singer, and dancer (b. 1…

Check-in: What did we do?

  1. We found some interesting data to scrape from the web.
  2. We used tidy tools to create tidy data:
    • Years and Wikipedia pages associated with them
    • Stored the scraped data with the matching year and URL
  3. We spent some time decoding the HTML & figuring out how to find where our data was stored
    • Struggled with lack of structure in the lists we wanted
    • Not a unique solution
  4. Wrote a function to scrape a page; applied it to each year in our data
  5. Output: A tidy data frame of one person per row with dates, names, and descriptions

A (Small) Data Analysis

  • We want to know if 2016 really was a very significant year of celebrity deaths
  • Let’s get a quick count
celebDeaths %>% 
  group_by(year) %>% 
  summarise(num_deaths = n()) %>% 
  arrange(desc(num_deaths)) %>% 
  head(10)
## # A tibble: 10 x 2
##     year num_deaths
##    <int>      <int>
##  1  2016        410
##  2  1989        362
##  3  2015        358
##  4  1992        313
##  5  2014        304
##  6  1988        294
##  7  1987        275
##  8  2013        258
##  9  1993        247
## 10  1996        238

Over time?

  • Some people have postulated that there is an increase in deaths because we are 50+ years out from the cultural revolution of the 1960s.
  • Let’s see if there’s a trend over time:

Conclusion

What did we do?

  • Learned about webscraping and why you’d want to do it
  • Saw some resources for webscraping in R
  • Got to know the tidyverse
  • Scraped data from the web with rvest
  • Discovered the longest inaugural address given by a US President was over 8,000 words
  • Found out that 2016 really was a major year in celebrity deaths
  • Had fun!

Thank you!

  • Questions? We have the room until 6pm!